13 research outputs found

    Language run-time systems:An overview

    Get PDF
    The proliferation of high-level programming languages with advanced language features and the need for portability across increasingly heterogeneous and hierarchical architectures require a sophisticated run-time system to manage program execution and available resources. Additional benefits include isolated execution of untrusted code and the potential for dynamic optimisation, among others. This paper provides a high-level overview of language run-time systems with a focus on execution models, support for concurrency and parallelism, memory management, and communication, whilst briefly mentioning synchronisation, monitoring, and adaptive policy control. Two alternative approaches to run-time system design are presented and several challenges for future research are outlined. References to both seminal and recent work are provided

    Adaptive architecture-transparent policy control in a distributed graph reducer

    Get PDF
    The end of the frequency scaling era occured around 2005 as the clock frequency has stalled for commodity architectures. Thus performance improvements that could in the past be expected with each new hardware generation needed to originate elsewhere. Almost all computer architectures exhibit substantial and growing levels of parallelism, exploiting which became one of the key sources of performance and scalability improvements. Alas, parallel programming proved much more difficult than sequential, due to the need to specify coordination and parallelism management aspects. Whilst low-level languages place the burden on the programmers reducing productivity and portability, semi-implicit approaches delegate the responsibility to sophisticated compilers and run-time systems. This thesis presents a study of adaptive load distribution based on work stealing using history and ancestry information in a distributed graph reducer for a nonstrict functional language. The results contribute to the exploration of more flexible run-time-system-level parallelism control implementing a semi-explicit model of parallelism, which offers productivity and high level of abstraction by delegating the responsibility of coordination to the run-time system. After characterising a set of parallel functional applications, we study the use of historical information to adapt the choice of the victim to steal from in a work stealing scheduler. We observe substantially lower numbers of messages for data-parallel and nested applications. However, this heuristic fails in cases where past application behaviour is not resembling future behaviour, for instance for Divide-&-Conquer applications with a large number of very fine-grained threads and generators of parallelism that move dynamically across processing elements. This mechanism is not specific to the language and the run-time system, and applies to other work stealing schedulers. Next, we focus on the other key work stealing decision of which sparks that represent potential parallelism to donate, investigating the effect of Spark Colocation on the performance of five Divide-&-Conquer programs run on a cluster of up to 256 PEs. When using Spark Colocation, the distributed graph reducer shares related work resulting in a higher degree of both potential and actual parallelism, and more fine-grained and less variable thread size. We validate this behaviour by observing a reduction in average fetch times, but increased amounts of FETCH messages and of inter-PE pointers for colocation, which nevertheless results in improved load balance for three of the five benchmark programs. The results show high speedups and speedup improvements for Spark Colocation for the three more regular and nested applications and performance degradation for two programs: one that is excessively fine-grained and one exhibiting limited scalability. Overall, Spark Colocation appears most beneficial for higher numbers of PEs, where improved load balance and higher degree of parallelism have more opportunities to pay off. In more general terms, we show that a run-time system can beneficially use historical information on past stealing successes that is gathered dynamically and used within the same run and the ancestry information dynamically reconstructed at run time using annotations. Moreover, the results support the view that different heuristics are beneficial for applications using different parallelism patterns, underlining the advantages of a flexible architecture-transparent approach.The Scottish Informatics and Computer Science Alliance (SICSA

    Predicting batch queue job wait times for informed scheduling of urgent HPC workloads

    Get PDF
    There is increasing interest in the use of HPC machines for urgent workloads to help tackle disasters as they unfold. Whilst batch queue systems are not ideal in supporting such workloads, many disadvantages can be worked around by accurately predicting when a waiting job will start to run. However there are numerous challenges in achieving such a prediction with high accuracy, not least because the queue's state can change rapidly and depend upon many factors. In this work we explore a novel machine learning approach for predicting queue wait times, hypothesising that such a model can capture the complex behaviour resulting from the queue policy and other interactions to generate accurate job start times. For ARCHER2 (HPE Cray EX), Cirrus (HPE 8600) and 4-cabinet (HPE Cray EX) we explore how different machine learning approaches and techniques improve the accuracy of our predictions, comparing against the estimation generated by Slurm. We demonstrate that our techniques deliver the most accurate predictions across our machines of interest, with the result of this work being the ability to predict job start times within one minute of the actual start time for around 65\% of jobs on ARCHER2 and 4-cabinet, and 76\% of jobs on Cirrus. When compared against what Slurm can deliver, this represents around 3.8 times better accuracy on ARCHER2 and 18 times better for Cirrus. Furthermore our approach can accurately predicting the start time for three quarters of all job within ten minutes of the actual start time on ARCHER2 and 4-cabinet, and for 90\% of jobs on Cirrus. Whilst the driver of this work has been to better facilitate placement of urgent workloads across HPC machines, the insights gained can be used to provide wider benefits to users and also enrich existing batch queue systems and inform policy too.Comment: Preprint of article at the 2022 Cray User Group (CUG

    History-based adaptive work distribution

    Get PDF
    Exploiting parallelism of increasingly heterogeneous parallel architectures is challenging due to the complexity of parallelism management. To achieve high performance portability whilst preserving high productivity, high-level approaches to parallel programming delegate parallelism management, such as partitioning and work distribution, to the compiler and the run-time system. Random work stealing proved efficient for well-structured workloads, but neglects potentially useful context information that can be obtained through static analysis or monitoring at run time and used to improve load balancing, especially for irregular applications with highly varying thread granularity and thread creation patterns. We investigate the effectiveness of an adaptive work distribution scheme to improve load balancing for an extension of Haskell which provides a deterministic parallel programming model and supports both shared-memory and distributed-memory architectures. This scheme uses a less random work stealing that takes into account information on past stealing successes and failures. We quantify run time performance, communication overhead, and stealing success of four divide-and-conquer and data parallel applications for three different update intervals on a commodity 64-core Beowulf cluster of multi-cores

    Architecture-Aware Cost Modelling for Parallel Performance Portability

    No full text
    Abstract: Languages for efficient parallel programming need to achieve high performance portability in order to harness the power offered by rapidly evolving parallel architectures. We use a combination of high-level architecture-aware cost modelling with a low-level, explicit control of coordination as a programming model to improve performance portability. We explore and quantify the impact of heterogeneity in modern parallel architectures on the performance of parallel programs on a range of clusters of multi-cores, varying in architectural parameters such as processor speed, memory size and interconnection speed. Additionally, we develop several formal cost models and automatically use these architectural characteristics to determine suitable granularity and work placement. The effectiveness of such cost-model-driven management of parallelism on common-place cluster hardware is demonstrated by measuring the performance of a parallel sparse matrix multiplication, implemented in C+MPI, on a range of heterogeneous architectures. On a cluster with 16 cores, the speedup increases from 6.2, without any cost model, to 9.1, indicating that even a simple, static cost model is effective in adapting the execution to the target architecture and in significantly improving parallel performance and scalability with negligible overhead.

    Workflows to Driving High-Performance Interactive Supercomputing for Urgent Decision Making

    No full text
    Interactive urgent computing is a small but growing user of supercomputing resources. However there are numerous technical challenges that must be overcome to make supercomputers fully suited to the wide range of urgent workloads which could benefit from the computational power delivered by such instruments. An important question is how to connect the different components of an urgent workload; namely the users, the simulation codes, and external data sources, together in a structured and accessible manner. In this paper we explore the role of workflows from both the perspective of marshalling and control of urgent workloads, and at the individual HPC machine level. Ultimately requiring two workflow systems, by using a space weather prediction urgent use-cases, we explore the benefit that these two workflow systems provide especially when one exploits the flexibility enabled by them interoperating

    Utilising urgent computing to tackle the spread of mosquito-borne diseases

    No full text
    It is estimated that around 80% of the world's population live in areas susceptible to at-least one major vector borne disease, and approximately 20% of global communicable diseases are spread by mosquitoes. Furthermore, the outbreaks of such diseases are becoming more common and widespread, with much of this driven in recent years by socio-demographic and climatic factors. These trends are causing significant worry to global health organisations, including the CDC and WHO, and-so an important question is the role that technology can play in addressing them. In this work we describe the integration of an epidemiology model, which simulates the spread of mosquito-borne diseases, with the VESTEC urgent computing ecosystem. The intention of this work is to empower human health professionals to exploit this model and more easily explore the progression of mosquito-borne diseases. Traditionally in the domain of the few research scientists, by leveraging state of the art visualisation and analytics techniques, all supported by running the computational workloads on HPC machines in a seamless fashion, we demonstrate the significant advantages that such an integration can provide. Furthermore we demonstrate the benefits of using an ecosystem such as VESTEC, which provides a framework for urgent computing, in supporting the easy adoption of these technologies by the epidemiologists and disaster response professionals more widely

    Utilising urgent computing to tackle the spread of mosquito-borne diseases

    No full text
    It is estimated that around 80% of the world's population live in areas susceptible to at-least one major vector borne disease, and approximately 20% of global communicable diseases are spread by mosquitoes. Furthermore, the outbreaks of such diseases are becoming more common and widespread, with much of this driven in recent years by socio-demographic and climatic factors. These trends are causing significant worry to global health organisations, including the CDC and WHO, and-so an important question is the role that technology can play in addressing them. In this work we describe the integration of an epidemiology model, which simulates the spread of mosquito-borne diseases, with the VESTEC urgent computing ecosystem. The intention of this work is to empower human health professionals to exploit this model and more easily explore the progression of mosquito-borne diseases. Traditionally in the domain of the few research scientists, by leveraging state of the art visualisation and analytics techniques, all supported by running the computational workloads on HPC machines in a seamless fashion, we demonstrate the significant advantages that such an integration can provide. Furthermore we demonstrate the benefits of using an ecosystem such as VESTEC, which provides a framework for urgent computing, in supporting the easy adoption of these technologies by the epidemiologists and disaster response professionals more widely
    corecore